Car price analysis results

Make

Graphical Comparison

A graphical review of log-price by make indicates most brands occupy some narrow portion of the price spectrum. A few brands stand out with relatively large pricing ranges (BMW, Mazda, Nissan).

No brands appear to be obvious pricing outliers, however a look at the number of samples per brand shows several brands with small portfolios. These brands will be removed from the subsequent analysis: Renault, Isuzu, Mercury, Chevrolet, Alfa Romeo, Porsche, and Jaguar

ANOVA Analysis

The ANOVA analysis of car makes returns a p-value < 0.05 and shows no abnormalities in normality, indicating significant differences in the mean prices of some brands.

Moving on to a more detailed pairwise Tukey analysis, we see statistically significant differences between the means of multiple brands. The largest differences occur between luxury brands like Mercedes, BMW and Volvo contrasted against more economic brands like Dodge, Honda, Mazda, and Plymouth.

Note: Due to the large number of combinations of makes (105), only significant Tukey results are presented below.

[1] "ANOVA Results:"
             Df Sum Sq Mean Sq F value Pr(>F)    
make         14  26.94  1.9240   23.82 <2e-16 ***
Residuals   168  13.57  0.0808                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] "Tukey Results:"


   diff      lwr      upr   p.adj  comparison               
-------  -------  -------  ------  -------------------------
 -0.834   -1.350   -0.319   0.000  dodge-audi               
 -0.793   -1.276   -0.310   0.000  honda-audi               
 -0.566   -1.031   -0.102   0.004  mazda-audi               
  0.631    0.103    1.159   0.005  mercedes-benz-audi       
 -0.691   -1.174   -0.208   0.000  mitsubishi-audi          
 -0.602   -1.063   -0.141   0.001  nissan-audi              
 -0.827   -1.371   -0.282   0.000  plymouth-audi            
 -0.747   -1.237   -0.258   0.000  subaru-audi              
 -0.622   -1.057   -0.187   0.000  toyota-audi              
 -0.578   -1.067   -0.089   0.006  volkswagen-audi          
 -1.176   -1.652   -0.701   0.000  dodge-bmw                
 -1.134   -1.574   -0.695   0.000  honda-bmw                
 -0.908   -1.328   -0.489   0.000  mazda-bmw                
 -1.033   -1.473   -0.593   0.000  mitsubishi-bmw           
 -0.944   -1.360   -0.528   0.000  nissan-bmw               
 -0.480   -0.934   -0.025   0.028  peugot-bmw               
 -1.169   -1.675   -0.662   0.000  plymouth-bmw             
 -1.089   -1.536   -0.643   0.000  subaru-bmw               
 -0.964   -1.351   -0.577   0.000  toyota-bmw               
 -0.920   -1.366   -0.473   0.000  volkswagen-bmw           
  1.465    0.990    1.941   0.000  mercedes-benz-dodge      
  0.697    0.257    1.136   0.000  peugot-dodge             
  0.674    0.159    1.190   0.001  saab-dodge               
  0.844    0.405    1.284   0.000  volvo-dodge              
  1.424    0.984    1.863   0.000  mercedes-benz-honda      
  0.655    0.254    1.056   0.000  peugot-honda             
  0.633    0.150    1.116   0.001  saab-honda               
  0.803    0.402    1.204   0.000  volvo-honda              
  1.197    0.778    1.617   0.000  mercedes-benz-mazda      
  0.429    0.050    0.807   0.011  peugot-mazda             
  0.577    0.198    0.955   0.000  volvo-mazda              
 -1.322   -1.762   -0.883   0.000  mitsubishi-mercedes-benz 
 -1.233   -1.649   -0.817   0.000  nissan-mercedes-benz     
 -0.769   -1.223   -0.314   0.000  peugot-mercedes-benz     
 -1.458   -1.964   -0.951   0.000  plymouth-mercedes-benz   
 -0.791   -1.319   -0.263   0.000  saab-mercedes-benz       
 -1.378   -1.825   -0.932   0.000  subaru-mercedes-benz     
 -1.253   -1.640   -0.866   0.000  toyota-mercedes-benz     
 -1.209   -1.656   -0.762   0.000  volkswagen-mercedes-benz 
 -0.621   -1.076   -0.166   0.001  volvo-mercedes-benz      
  0.554    0.153    0.954   0.000  peugot-mitsubishi        
  0.531    0.048    1.014   0.017  saab-mitsubishi          
  0.701    0.301    1.102   0.000  volvo-mitsubishi         
  0.464    0.090    0.839   0.003  peugot-nissan            
  0.612    0.238    0.987   0.000  volvo-nissan             
 -0.689   -1.162   -0.216   0.000  plymouth-peugot          
 -0.610   -1.018   -0.201   0.000  subaru-peugot            
 -0.484   -0.826   -0.142   0.000  toyota-peugot            
 -0.440   -0.849   -0.032   0.022  volkswagen-peugot        
  0.667    0.122    1.211   0.004  saab-plymouth            
  0.837    0.364    1.310   0.000  volvo-plymouth           
 -0.587   -1.077   -0.098   0.005  subaru-saab              
 -0.462   -0.897   -0.027   0.026  toyota-saab              
  0.758    0.349    1.166   0.000  volvo-subaru             
  0.632    0.290    0.974   0.000  volvo-toyota             
  0.588    0.180    0.997   0.000  volvo-volkswagen         

Bootstrap Means

In this section, we generate bootstrapped sample means for each brand and analyze their differences. Individual bootstrapped means look similar to the initial box plots.

Comparison of bootstrapped means returns similar results, with comparisons between expensive brands like Mercedes, BMW and Volvo against Dodge, Mazda, Plymouth and Honda dominating the tails.

Note: Histograms were collapsed into point-range plots for this analysis due to the large number of makes.

Body Style

Graphical Comparison

A cursory glance at the differences between body styles reveals generous overlap between the inter-quartile ranges of the various body styles. Looking at the number of samples per body style, we see that convertible and hardtop are relatively rare, and will not be included in subsequent analysis.

ANOVA Analysis

The ANOVA returns a p-value < 0.05, indicating some body styles command different mean prices. It must be noted that the Q-Q plot shows some non-normality in the lower price quartiles.

Deeper analysis of the Tukey ANOVA reveals a significant difference between the prices of hatchbacks and sedans.

[1] "ANOVA Results:"
             Df Sum Sq Mean Sq F value   Pr(>F)    
body.style    2   3.85  1.9264   9.274 0.000145 ***
Residuals   184  38.22  0.2077                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[1] "Tukey Results:"


   diff      lwr     upr   p.adj  comparison      
-------  -------  ------  ------  ----------------
  0.311    0.139   0.482   0.000  sedan-hatchback 
  0.223   -0.029   0.475   0.094  wagon-hatchback 
 -0.088   -0.330   0.154   0.668  wagon-sedan     

Bootstrap Means

Comparing the bootstrapped means shows a similar conclusion: the 95% CI on the difference between hatchbacks and sedans is between approximately -0.47 and -0.27. Interestingly, 95% CI on the difference between wagons and hatchbacks also indicates a likely difference in price- this is not consistent with the ANOVA.

Drive Wheels

Graphical Analysis

Graphical analysis of drive wheels seems to indicate RWD cars command a price premium. Though there are relatively few 4WD samples, the sample was still usable.

ANOVA Analysis

The ANOVA analysis of drive wheels indicates that some drive configurations are indeed significantly different. No normality issues are apparent.

The Tukey plots reveal significant differences between RWD-4WD and RWD-FWD. 4WD and FWD are not significantly different.

[1] "ANOVA Results:"
              Df Sum Sq Mean Sq F value Pr(>F)    
drive.wheels   2  23.81  11.903   88.46 <2e-16 ***
Residuals    198  26.64   0.135                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] "Tukey Results:"


   diff      lwr     upr   p.adj  comparison 
-------  -------  ------  ------  -----------
 -0.119   -0.435   0.198   0.649  fwd-4wd    
  0.599    0.276   0.921   0.000  rwd-4wd    
  0.718    0.590   0.845   0.000  rwd-fwd    

Bootstrap

The boostrapped means of drive configurations arrive at the same conclusion- RWD is signigicantly different than both 4WD and FWD, and 4WD-FWD are not significantly different.

Cylinder Count

Graphically, we can see that only four, five, six, and eight cylinder engines have enough samples to analyze. It appears that four cylinder models occupy the low end of the price range, five and six cylinders the mid range, and eight cylinders the high end.

ANOVA Analysis

The ANOVA indeed supports a significant difference between cylinder counts, though some non-normality appears to occur in the upper prices.

The Tukey ANOVA shows that only the five and six cylinder engines are priced similarly. Differences between the four or eight cylinder engines to other configurations are all significant.

[1] "ANOVA Results:"
                  Df Sum Sq Mean Sq F value Pr(>F)    
num.of.cylinders   3  24.81   8.269   67.01 <2e-16 ***
Residuals        191  23.57   0.123                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[1] "Tukey Results:"


   diff      lwr      upr   p.adj  comparison 
-------  -------  -------  ------  -----------
 -0.601   -1.140   -0.062   0.022  five-eight 
 -1.387   -1.848   -0.926   0.000  four-eight 
 -0.554   -1.046   -0.063   0.020  six-eight  
 -0.786   -1.083   -0.489   0.000  four-five  
  0.047   -0.296    0.389   0.985  six-five   
  0.833    0.633    1.032   0.000  six-four   

Bootstrap

The bootstrap analysis arrives at the same result, showing the four cylinder at the lower prices, five and six cylinders in the middle, and eight at the top of the price range.

Engine Type

Graphical Analysis

Graphical analysis shows a difference in price by engine type, though with overlap between many types. The rotor and dohcv engines are rare and will not be included in further analysis.

ANOVA Analysis

The ANOVA indicates significant price differences between some engine types. The Tukey plot highlights these differences between ohcv-ohc, ohcv-ohcf, and ohc-dohc types.

[1] "ANOVA Results:"
             Df Sum Sq Mean Sq F value   Pr(>F)    
engine.type   4  10.04  2.5103   11.96 1.06e-08 ***
Residuals   192  40.28  0.2098                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] "Tukey Results:"


   diff      lwr      upr   p.adj  comparison 
-------  -------  -------  ------  -----------
 -0.181   -0.696    0.333   0.868  l-dohc     
 -0.487   -0.866   -0.108   0.005  ohc-dohc   
 -0.416   -0.905    0.072   0.135  ohcf-dohc  
  0.309   -0.196    0.814   0.447  ohcv-dohc  
 -0.305   -0.684    0.074   0.177  ohc-l      
 -0.235   -0.723    0.254   0.677  ohcf-l     
  0.490   -0.015    0.995   0.062  ohcv-l     
  0.070   -0.272    0.413   0.980  ohcf-ohc   
  0.795    0.430    1.161   0.000  ohcv-ohc   
  0.725    0.247    1.203   0.000  ohcv-ohcf  

Bootstrap

The bootstrap also picks up on differences between ohcv-ohc, ohcv-ohcf, and ohc-dohc types, but also picks up on differences between dohc-ohcf and, interestingly, sees a separation of the mid-range l type against the lower-priced ohc and the high-priced ohcv types.

Source

#### Load libraries ####
library(dplyr)
library(reshape2)
library(ggplot2)
library(lazyeval)
library(HistData)
library(resample)
library(simpleboot)
library(knitr)
library(RColorBrewer)




######################################################################
##  Import and set typing + basic calculated columns
######################################################################

# Import
auto.data <- read.csv('Automobile price data _Raw_.csv', stringsAsFactors = FALSE)

# Set types on factor columns
factor.cols <- c("make", "fuel.type", "aspiration", "num.of.doors",
                 "body.style", "drive.wheels", "engine.location",
                 "engine.type", "num.of.cylinders", "fuel.system")
auto.data[, factor.cols] <- lapply(auto.data[factor.cols], as.factor)

# Set types on numeric columns
numeric.cols <- c("wheel.base", "length", "width", "height",
                  "curb.weight", "engine.size", "bore", "stroke",
                  "compression.ratio", "horsepower", "peak.rpm",
                  "city.mpg", "highway.mpg", "price")
auto.data[, numeric.cols] <- lapply(auto.data[numeric.cols], as.numeric)

# Calculate log-price column
auto.data <- auto.data %>% mutate(log.price = log(price))






######################################################################
##  Define Functions
######################################################################


#### Graphical analysis function ####

f_graph_analysis <- function(data, dependent, independent) {
  ## Description: This function takes a dataset with dependent and independent
  ##              variables, and generates a boxplot with independent variables
  ##              ordered by the median of the dependent. Also plots a bar chart
  ##              of counts of each independent variable level.
  ## Arguments:
  ##    data: A dataframe for plotting
  ##    dependent: character object describing dependent variable column name
  ##    independent: character object describing independent variable column name
  
  # Select columns of interest and remove rows with blanks
  plot.data <- data[, c(dependent, independent)]
  plot.data <- plot.data[complete.cases(plot.data), ]
  
  # Provide generic column names to standardize analysis pipelines
  colnames(plot.data) <- c("dependent", "independent")
  
  # Re-order independent variable factor levels by median price (dependent variable)
  median.plot.data <- aggregate(dependent ~ independent, plot.data, median)
  plot.data$independent <- factor(plot.data$independent,
                                  levels = median.plot.data[order(median.plot.data$dependent), "independent"])
  
  # Build a boxplot of independent-dependent values
  p.boxplot <- ggplot(plot.data, aes(independent, dependent)) +
    geom_boxplot() +
    labs(title = paste("Box plot of ", independent, " vs. ", dependent, sep =),
         x = independent, y = dependent)
  print(p.boxplot)
  
  # Plot number of observations of each level of independent variable
  p.counts <- ggplot(plot.data, aes(independent)) +
    geom_bar() +
    labs(title = paste("Counts of ", independent, sep =""),
         x = independent, y = "count")
  print(p.counts)
  
}








#### Filtering function ####

f_remove_levels <- function(data, target.column, remove.levels = NULL, na.columns = "log.price") {
  ## Description: This function removes any levels of an independent variable that
  ##              we want to eliminate before ANOVA or bootstrap analysis. Also removes
  ##              any rows where the dependent variable has a value of NA.
  ##
  ## Arguments:
  ##    data: input data frame
  ##    target.column: column to massage
  ##    remove.levels: any levels of the target column we want filtered out (e.g. if a level has too few samples to model)
  ##    na.columns: columns to remove NA values from.
  
  # If remove.levels has been defined, build an index of rows in data that contain those 
  # values and remove them
  if (!is.null(remove.levels)) {
    remove.index <- which(data[, target.column] %in% remove.levels)
    data <- data[-remove.index, ]
  }
  
  # Remove rows containing an NA in na.columns
  data <- data[!is.na(data[, na.columns]), ]
  
  # Drop unused factor levels made extinct in the above processing
  data <- droplevels(data)
  
  return(data)
}




## ANOVA function
f_anova_analyses <- function(data, formula, truncate = FALSE, suppress.diag.plot = TRUE) {
  ## Description: Takes dataset and formula and calculates ANOVA and Tukey HSD. 
  ##              Returns ANOVA and Tukey tables and plots.
  ## Arguments:
  ##    data: Dataframe for analysis
  ##    formula: A formula-type object describing the model. ex: formula(y ~ x)
  ##    truncate: Filter printed results to only show significant (p<0.05) results. Use when too many levels are presented to be meaningful.
  ##    suppress.diag.plot: Suppresses output of diagnostic plots from ANOVA. Suppress in reporting when no diagnostic issues are known.
  
  
  # Calculate basic ANOVA
  anova.df <- aov(formula, data = data)
  
  # Print ANOVA table
  print("ANOVA Results:")
  print(summary(anova.df))
  
  # Set a 2x2 plot layout and print diagnostic plots
  # Only executes if suppress.diag.plot == FALSE
  if (suppress.diag.plot == FALSE) {
    layout(matrix(c(1,2,3,4),2,2))
    plot(anova.df)
    layout(matrix(c(1),1,1))
  }
  
  
  # Calculate Tukey ANOVA from ANOVA results
  tukey.df <- TukeyHSD(anova.df)
  
  # Build data frame from Tukey results for manipulation
  tukey.df <- as.data.frame(tukey.df[1])
  colnames(tukey.df) <- c("diff", "lwr", "upr", "p.adj")  # Re-assert names lost in df translation
  tukey.df$comparison <- factor(rownames(tukey.df), levels = row.names(tukey.df)[order(tukey.df$diff)])  # Copy rownames into a column
  row.names(tukey.df) <- NULL  # Remove row names
  tukey.df$significant <- ifelse(tukey.df$p.adj <= 0.05, "yes", "no")  # Flag statistically significant rows
  
  
  # When truncate == TRUE, filter out Tukey results that are not significant
  if (truncate == TRUE) {
    tukey.df <- tukey.df %>% filter(p.adj <= 0.05)
  }
  
  
  # Print Tukey table. Use kable for nicer looking prints
  print("Tukey Results:")
  print(kable(tukey.df[, 1:5], digits = 3))
  
  
  ## Plot results
  
  # Create a custom color scale, mapping colors to statistical significance (tukey.df$significant)
  myColors <- brewer.pal(5,"Set1")
  names(myColors) <- levels(tukey.df$significant)
  colScale <- scale_colour_manual(name = "significant",values = myColors)
  
  # Build ggplot of tukey.df. Note the coord_flip to place names on Y axis, and h-line at zero.
  p.tukey <- ggplot(tukey.df, aes(x = comparison, y = diff, ymin = lwr, ymax = upr)) +
    geom_pointrange(aes(colour = significant)) +
    colScale +
    coord_flip() +
    geom_hline(yintercept = 0) +
    labs(title = "Tukey HSD Results", y = "Difference of log-price means")
  print(p.tukey)
  
}



## Bootstrapping functions
f_multilevel_one.boot <- function(data, independent, dependent, type = "mean", R.count = 10000) {
  ## Description: This function generates bootstrap samples of each level
  ##              of an independent variable (e.g. "ford" and "gm" levels of variable "make").
  ## Arguments:
  ##    data: dataframe
  ##    independent: independent variable column name
  ##    dependent: dependent variable column name
  ##    type: mean or median bootsrap
  ##    R.count: number of bootstrap samples
  
  # Break out target levels
  boot.levels <- levels(data[, independent])
  
  # Bootstrap means/medians of each level of independent variable
  boots <- data.frame()  # Pre-allocate results dataframe
  for (i in 1:length(boot.levels)) {
    current.level.data <- data[which(data[, independent] == boot.levels[i]), ]  # Data for current level of independent variable
    boot.level <- one.boot(current.level.data[, dependent], substitute(type), R = R.count)  # Bootstrap R.count iterations of mean/median of current data
    boots <- rbind(boots, t(boot.level$t))  # Append results to boots dataframe
  }
  
  # Re-orient results and name columns with independent variable levels
  boots <- as.data.frame(t(boots))
  colnames(boots) <- boot.levels
  
  return(boots)
}




f_multilevel_two.boot <- function(data, independent, dependent, type = "mean", R.count = 10000) {
  ## Description: This function generates bootstrap samples of DIFFERENCES between each level
  ##              of an independent variable (e.g. "ford" and "gm" levels of variable "make").
  ## Arguments:
  ##    data: dataframe
  ##    independent: independent variable column name
  ##    dependent: dependent variable column name
  ##    type: mean or median bootsrap
  ##    R.count: number of bootstrap samples
  
  
  # List out all levels of independent variable
  boot.levels <- levels(data[, independent])
  
  # Generate all combinations of levels
  # Example:
  #     1    2    3
  # 1  1,1  1,2  1,3
  # 2  2,1  2,2  2,3
  # 3  3,1  3,2  3,3
  boot.combinations <- expand.grid(boot.levels, boot.levels)
  
  # Remove comparisons of a level to itself
  # Example: Remove 1,1  2,2  and 3,3
  boot.combinations <- boot.combinations[which(boot.combinations$Var1 != boot.combinations$Var2),]
  
  # Generate numeric columns from factor levels for easy manipulation
  boot.combinations$Num1 <- as.numeric(boot.combinations$Var1)
  boot.combinations$Num2 <- as.numeric(boot.combinations$Var2)
  # Use numeric values to remove duplicate, but reversed comparisons
  # Example: 1,2 and 2,1 are the same. Note that in the upper half of the diagonal matrix, the
  #          row number is always lower than the column number. This makes a simple filtering rule:
  #     1     2      3
  # 1  1,1  [1,2]  [1,3]
  # 2  2,1   2,2   [2,3]
  # 3  3,1   3,2    3,3
  boot.combinations <- boot.combinations[which(boot.combinations$Num1 < boot.combinations$Num2), ]
  # Remove numeric columns now that we're done filtering.
  boot.combinations <- boot.combinations %>% select(Var1, Var2)
  
  # Generate bootstraps for the pairs.
  boots <- data.frame()
  for (i in 1:nrow(boot.combinations)) {
    sample.a <- data[which(data[, independent] == boot.combinations$Var1[i]), ]
    sample.b <- data[which(data[, independent] == boot.combinations$Var2[i]), ]
    
    two.boot.temp <- two.boot(sample.a[, dependent], sample.b[, dependent], substitute(type), R = R.count)
    
    boots <- rbind(boots, t(two.boot.temp$t))
  }
  
  # Re-orient results and add column names
  boots <- as.data.frame(t(boots))
  colnames(boots) <- paste(boot.combinations$Var1, boot.combinations$Var2, sep = ".")
  
  return(boots)
}




## Bootstrap Plotting functions
plot.hist <- function(a, maxs, mins, cols = 'difference of means', nbins = 80, p = 0.05) {
  ## Description: Histogram plotting function copied from DS350 course code. 
  breaks = seq(maxs, mins, length.out = (nbins + 1))
  hist(a, breaks = breaks, main = paste('Histogram of', cols), xlab = cols)
  abline(v = mean(a), lwd = 4, col = 'red')
  abline(v = 0, lwd = 4, col = 'blue')
  abline(v = quantile(a, probs = p/2), lty = 3, col = 'red', lwd = 3)  
  abline(v = quantile(a, probs = (1 - p/2)), lty = 3, col = 'red', lwd = 3)
}






## Plot multiple histograms for bootstrap results on multiple levels
f_multi_hist <- function(data, simplify = FALSE, nbins = 80, p = 0.05,
                         plot.title = "Histogram", y.label = "mean log-price", x.label = "Level"){
  ## Description: Builds multiple histograms of bootstrap results for variables with multiple levels.
  ##              There is a "simplify" option to collapse histograms into simpler point-ranges for 
  ##              cases where too many levels (more than ~5) generate an overwhelming number of histograms.
  ##
  ## Arguments:
  ##    data: Bootstrap results
  ##    simplify: Generates point-ranges instead of histograms when TRUE
  ##    nbins: Number of histogram bins
  ##    p: Confidence interval threshold
  ##    plot.title: Plot title
  ##    y.label: X label
  ##    x.label: Y label
  
  # Get max and min values for plot limits
  maxs = max(data)
  mins = min(data)
  
  
  # Show histograms when simplify == FALSE
  if (simplify == FALSE) {
    # Use par to merge multiple plots into a single plot window. Use ncol(data) to make as many plots as there are levels.
    par(mfrow = c(ncol(data), 1))
    # Generate each individual plot and add to the plot layout initiated in the previous line.
    for (i in 1:ncol(data)) {
      plot.hist(data[, i], maxs, mins, cols = colnames(data)[i])
    }
    title(plot.title, outer = TRUE, cex.main = 2)
    # Reset plot window to 1x1 for future plots
    par(mfrow = c(1, 1))
  }
  
  
  # Condense into pointrange plots when simplify == TRUE
  if (simplify == TRUE) {
    # Calculate mean, upr, lwr values
    mean <- apply(data, 2, mean)
    upr <- apply(data, 2, function(x) quantile(x, probs = 1 - p/2))
    lwr <- apply(data, 2, function(x) quantile(x, probs = p/2))
    # Combine basic stats into a dataframe
    plot.data <- data.frame("name" = colnames(data), mean, upr, lwr)
    # Make name column a factor type, and order factor levels by their means for ordered plotting
    plot.data$name <- factor(plot.data$name, levels = plot.data$name[order(plot.data$mean)])
    
    # Generate pointrange plot of bootstrap results
    p.multi <- ggplot(plot.data, aes(x = name, y = mean, ymin = lwr, ymax = upr)) +
      geom_pointrange() +
      coord_flip() +
      geom_hline(yintercept = 0) +
      lims(y = c(min(plot.data$lwr), max(plot.data$upr))) +
      labs(title = plot.title, y = y.label, x = x.label)
    print(p.multi)
  }
}







######################################################################
##  Analyses
######################################################################


### Make

# Graphical Analysis
f_graph_analysis(data = auto.data, dependent = "log.price", "make")

# Remove unusual levels
make.data <- f_remove_levels(auto.data, "make", remove.levels = c("renault", "isuzu", "mercury", "chevrolet", "alfa-romero", "porsche", "jaguar"))

# ANOVA
f_anova_analyses(make.data, formula(log.price ~ make), suppress.diag.plot = FALSE)

# Bootstrap means
make.one.boot <- f_multilevel_one.boot(make.data, "make", "log.price")
f_multi_hist(make.one.boot, simplify = TRUE, plot.title = "Bootstrap means of log-price by make")

# Bootstrap mean differences
make.two.boot <- f_multilevel_two.boot(make.data, "make", "log.price")
f_multi_hist(make.two.boot, simplify = TRUE, plot.title = "Difference of bootstrap mean log-price by make", y.label = "Mean log-price difference")







#### Body Style

# Graphical Comparison
f_graph_analysis(data = auto.data, dependent = "log.price", "body.style")

# Remove unusual levels
body.data <- f_remove_levels(auto.data, "body.style", remove.levels = c("convertible", "hardtop"))

# ANOVA
f_anova_analyses(body.data, formula(log.price ~ body.style), suppress.diag.plot = FALSE)

# Bootstrap means
body.one.boot <- f_multilevel_one.boot(body.data, "body.style", "log.price")
f_multi_hist(body.one.boot, plot.title = "Bootstrap means of log-price by body style")

# Bootstrap mean differences
body.two.boot <- f_multilevel_two.boot(body.data, "body.style", "log.price")
f_multi_hist(body.two.boot, plot.title = "Difference of bootstrap mean log-price by body style", y.label = "Mean log-price difference")






#### Drive Wheels

# Graphical
f_graph_analysis(data = auto.data, dependent = "log.price", "drive.wheels")

# Remove unusual levels
drive.data <- f_remove_levels(auto.data, "drive.wheels")

# ANOVA
f_anova_analyses(drive.data, formula(log.price ~ drive.wheels), suppress.diag.plot = FALSE)

# Bootstrap means
drive.one.boot <- f_multilevel_one.boot(drive.data, "drive.wheels", "log.price")
f_multi_hist(drive.one.boot, plot.title = "Bootstrap means of log-price by drive wheels")

# Bootstrap mean differences
drive.two.boot <- f_multilevel_two.boot(drive.data, "drive.wheels", "log.price")
f_multi_hist(drive.two.boot, plot.title = "Difference of bootstrap mean log-price by drive wheels", y.label = "Mean log-price difference")








#### Cylinder Count

# Compare variable levels graphically
f_graph_analysis(data = auto.data, dependent = "log.price", "num.of.cylinders")

# Remove unusual levels
cylinder.data <- f_remove_levels(auto.data, "num.of.cylinders", remove.levels = c("two", "three", "twelve"))

# ANOVA
f_anova_analyses(cylinder.data, formula(log.price ~ num.of.cylinders), suppress.diag.plot = FALSE)

# Bootstrap means
cylinder.one.boot <- f_multilevel_one.boot(cylinder.data, "num.of.cylinders", "log.price")
f_multi_hist(cylinder.one.boot, simplify = TRUE, plot.title = "Bootstrap means of log-price by cylinder count")

# Bootstrap mean differences
cylinder.two.boot <- f_multilevel_two.boot(cylinder.data, "num.of.cylinders", "log.price")
f_multi_hist(cylinder.two.boot, simplify = TRUE, plot.title = "Difference of bootstrap mean log-price by cylinder count", y.label = "Mean log-price difference")











#### Engine Type

# Compare variable levels graphically
f_graph_analysis(data = auto.data, dependent = "log.price", "engine.type")

# Remove unusual levels
engine.type.data <- f_remove_levels(auto.data, "engine.type", remove.levels = c("dohcv", "rotor"))

# ANOVA
f_anova_analyses(engine.type.data, formula(log.price ~ engine.type), suppress.diag.plot = FALSE)

# Bootstrap means
engine.type.one.boot <- f_multilevel_one.boot(engine.type.data, "engine.type", "log.price")
f_multi_hist(engine.type.one.boot, simplify = TRUE, plot.title = "Bootstrap means of log-price by engine type")

# Bootstrap mean differences
engine.type.two.boot <- f_multilevel_two.boot(engine.type.data, "engine.type", "log.price")
f_multi_hist(engine.type.two.boot, simplify = TRUE, plot.title = "Difference of bootstrap mean log-price by engine type", y.label = "Mean log-price difference")